Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: apply to_tsvector to the filtered columns when a fts operator is used #3845

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

laurenceisla
Copy link
Member

@laurenceisla laurenceisla commented Jan 14, 2025

Closes #2255

When an fts operator is used, then it applies to_tsvector to the filtered column. For example:

/table?column=fts(simple).test

is equivalent to this in SQL:

SELECT *
FROM table
WHERE to_tsvector('simple', column) = to_tsquery('simple', 'test')
If the `totsv` modifier (short for `to_tsvector`) is used for `fts` operators then it applies `to_tsvector` to the filtered column. For example:
/table?column=totsv.fts(simple).test

Is equivalent to this in SQL:

SELECT *
FROM table
WHERE to_tsvector('simple', column) = to_tsquery('simple', 'test')

@laurenceisla laurenceisla force-pushed the feat-totsv branch 2 times, most recently from b9cee18 to b8f8db2 Compare January 14, 2025 16:14
@laurenceisla laurenceisla marked this pull request as ready for review January 14, 2025 16:16
CHANGELOG.md Outdated
@@ -26,6 +26,7 @@ This project adheres to [Semantic Versioning](http://semver.org/).
- #3795, Clarify `Accept: vnd.pgrst.object` error message - @steve-chavez
- #3779, Always log the schema cache load time - @steve-chavez
- #3706, Fix insert with `missing=default` uses default value of domain instead of column - @taimoorzaeem
- #2255, Fix `to_tsvector()` not applying explicitly to the full-text search filtered column (on non `tsvector` types) - @laurenceisla
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure if this should be considered a new feature or a fix. I think the user expects the to_tsvector to be applied under the hood, so I'm adding it here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the issue says "enhancement", the PR title says "feat:" - and yet you conclude to add it to the "fixed" section :D

I'd look at this from the angle of backporting. I wouldn't backport it, at least in its current form. I understand you're applying this transformation to every column except those with type tsvector right now. This does have a considerably risk of breaking indexes, we only need to go back as far as when we did something similar with to_jsonb(...) around everything to allow accessing fields or composite types or so. It did break indexes.

Now.. what if we only apply this transformation for json/jsonb types? That would be a lot less risky, in which case we could think about backporting.

Although... even in this case, I'd still argue that this is more of a feature than a fix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the PR title says "feat:"

Oh woops, I left that from the previous iteration (adding the totsv modifier looked more like a feat in that case for me). Yeah, I get it now.

I understand you're applying this transformation to every column except those with type tsvector right now. This does have a considerably risk of breaking indexes, we only need to go back as far as when we did something similar with to_jsonb(...) around everything to allow accessing fields or composite types or so

Yes, although there's a difference in that the to_tsvector() is already applied implicitly to the text types with the @@ operator.

text @@ tsquery → boolean

Does text string, after implicit invocation of to_tsvector(), match tsquery?

'fat cats ate rats' @@ to_tsquery('cat & rat') → t

The other types will still fail but with a different error message, e.g. function to_tsquery(integer) does not exist instead of operator does not exist: integer @@ tsquery.

Copy link
Member

@wolfgangwalther wolfgangwalther Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other types will still fail but with a different error message, e.g. function to_tsquery(integer) does not exist instead of operator does not exist: integer @@ tsquery.

Which kind of brings me back to my original question: #2255 (comment)

Did anyone try creating the missing @@ operator for jsonb with a function that does this to_tsvector call explicitly?

Like.. do we actually need this fix feature, or could we just advise people to create the respective cast operator?

Copy link
Member Author

@laurenceisla laurenceisla Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does have a considerably risk of breaking indexes,

In fact, I think it's the contrary, actually. I've been checking more about FTS indexes and the docs say:

Because the two-argument version of to_tsvector was used in the index above, only a query reference that uses the 2-argument version of to_tsvector with the same configuration name will use that index. That is, WHERE to_tsvector('english', body) @@ 'a & b' can use the index, but WHERE to_tsvector(body) @@ 'a & b cannot. This ensures that an index will be used only with the same configuration used to create the index entries.

As it is right now if we a GIN index to a column of text type, PostgREST won't use it. This change (appying to all types, not only JSON) would fix that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fail to read your conclusion out of this.

However, I read something else:

Another approach is to create a separate tsvector column to hold the output of to_tsvector. To keep this column automatically up to date with its source data, use a stored generated column.

It seems to me that suggesting to our users to create generated columns with the tsvector output of the json column would solve the problem nicely. Then this generated column can be searched - without any additional operators or to_tsvector wrappers. It also scales much better, because this solution will easily allow users to add those config arguments ("english" etc.) or create the tsvector column over multiple columns.

If instead, we implement as proposed here, I see the next feature request coming very soon: "I need to pass 'english' to the two-argument-version of tsvector, please let PostgREST do that for me."

Copy link
Member Author

@laurenceisla laurenceisla Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like.. do we actually need this fix feature, or could we just advise people to create the respective cast operator?

It seems to me that suggesting to our users to create generated columns with the tsvector output of the json column would solve the problem nicely.

Yes, that'd be another solution. We'd need to mention that it won't work as expected for text or json columns and to use tsvector generated columns and give the suggestions you list. Nonetheless, I think having an out of the box option to use would be a good experience.

If instead, we implement as proposed here, I see the next feature request coming very soon: "I need to pass 'english' to the two-argument-version of tsvector, please let PostgREST do that for me."

Do you mean to_tsvector(config_language, text)? Cause... that's what it's implemented in this PR: col=fts(english).val in SQL would be to_tsvector('english', col) @@ to_tsquery('english', 'val') (I updated the original post of the PR to reflect this). I figured it made sense having the same config languages in the tsquery and the tsvector, so using the one in fts(language) for both shouldn't be a probem.


All in all, I'm on the fence here. I see the value in both having the implementation out of the box and recommending generated columns as an alternative and also for complex cases (computed fields are not included in the PR, for instance).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to_tsvector(config_language, text)? Cause... that's what it's implemented in this PR: col=fts(english).val in SQL would be to_tsvector('english', col) @@ to_tsquery('english', 'val') (I updated the original post of the PR to reflect this). I figured it made sense having the same config languages in the tsquery and the tsvector, so using the one in fts(language) for both shouldn't be a probem.

Ah. I didn't look at the code, sorry. And apparently not at the PR description either :D.

OK, now I understand your comment about indexing better as well. And then I see why you'd want to apply this not only to json, but also to text, for consistency.


All in all, I'm on the fence here. I see the value in both having the implementation out of the box and recommending generated columns as an alternative and also for complex cases (computed fields are not included in the PR, for instance).

Seems like we should maybe do both. This PR and a better explanation of the generated column approach for special cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we should maybe do both. This PR and a better explanation of the generated column approach for special cases?

Agree, opened #3870 to track this. Meanwhile this PR should be ready for review/merging.

Comment on lines 289 to +290
res = fromMaybe (unknownField fn jp) $ HM.lookup qi tables >>=
Just . flip resolveTableFieldName fn
Just . (\t -> resolveTableFieldName t fn toTsV)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a performance concern here: #2255 (comment)

I don't think there will be a difference in performance since it's using this lookup that is already applied for all filters before this PR.

@laurenceisla laurenceisla changed the title feat: add "totsv" modifier to the "fts" operator feat: apply to_tsvector to the filtered columns when a fts operator is used Jan 21, 2025
@@ -20,6 +21,7 @@ import PostgREST.SchemaCache.Identifiers (FieldName)
import Protolude

type TransformerProc = Text
type ToTsVector = Maybe (Maybe Text)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing at first glance, why a double Maybe?

@@ -193,7 +193,7 @@ The :code:`fts` filter mentioned above has a number of options to support flexib

curl "http://localhost:3000/tsearch?my_tsv=not.wfts(french).amusant"

Using `websearch_to_tsquery` requires PostgreSQL of version at least 11.0 and will raise an error in earlier versions of the database.
This also works with columns of text and JSON types, converting them to ``tsvector`` accordingly.
Copy link
Member

@steve-chavez steve-chavez Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A subsection would be better for this. This way we can also highlight the feature on the release notes.

Maybe "automatic tsvector casting"`, "automatic to_tsvector??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Change FTS operators to explicitly call to_tsvector()
3 participants